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A lexical database tool tailored for phonological res- 
earch is described. Database fields include transcrip- 
tions, glosses and hyperlinks to speech files. Database 
queries are expressed using HTML forms, and these 
permit regular expression search on any combination 
of fields. Regular expressions are passed directly to 
a Perl CGI program, enabling the full flexibility of 
Perl extended regular expressions. The regular expres- 
sion notation is extended to better support phonologi- 
cal searches, such as search for minimal pairs. Search 
results are presented in the form of HTML or ET^X 
tables, where each cell is either a number (represent- 
ing frequency) or a designated subset of the fields. 
Tables have up to four dimensions, with an elegant 
system for specifying which fragments of which fields 
should be used for the row/column labels. The tool 
offers several advantages over traditional methods of 
analysis: (i) it supports a quantitative method of doing 
phonological research; (ii) it gives universal access 
to the same set of informants; (iii) it enables other 
researchers to hear the original speech data without 
having to rely on published transcriptions; (iv) it makes 
the full power of regular expression search available, 
and search results are full multimedia documents; and 
(v) it enables the early refutation of false hypotheses, 
shortening the analysis-hypothesis-test loop. A life- 
size application to an African tone language (Dschang) 
is used for exemplification throughout the paper. The 
database contains 2200 records, each with approxi- 
mately 15 fields. Running on a PC laptop with a stand- 
alone web server, the 'Dschang HyperLexicon' has 
already been used extensively in phonological field- 
work and analysis in Cameroon. 



Initial stages of phonological analysis typically focus 
on words in isolation, as the phonemic inventory and 
syllable canon are established. Data is stored as a 
lexicon, where each word is entered as a transcription 
accompanied by at least a gloss (so the word can be 
elicited again) and the major syntactic category. In 
managing a lexicon, the working phonologist has a 
variety of computational needs: storage and retrieval; 
searching and sorting; tabular reports on distributions 
and contrasts; updates to database and to reports as 
distinctions are discovered or discarded. In the past 
the analyst had to do all this computation by hand 
using index cards kept in shoeboxes. But now many of 
these particular tasks are automated by software such 



as the SIL programs Shoebox (Buseman et al., 1996) 
and Findphone ( Bevan, 1995 )1*] or using commercial 
database packages. 

Of course, many tasks other than those listed above 
have already benefitted from (partial) automation.^ Addi- 
tionally, it has been shown how a computational inher- 
itance model can be used for structuring lexical infor- 
mati on relevant for phonology (R einhard & Gibbon, 
1991). And there is a body of work on the use of finite 
state devices - closely related to regular expressions 
- for mod elling phonological phenomena (K aplan & 
Kay, 1994) and for speech processing (cf. Komai's 



^Unlike regular database management systems, these include 
international and phonetic character sets and user-defined 
keystrokes for entering them, and a utility to dump a database into 
an RTF file in a user-defined lexicon format for use in desktop 
publishing. 



For example, see (Ellison. 1992 : Lo\ye & Mazaudon, 1994 
!!oleman, Dirksen, Hussain & Waals, 19961. 
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Figure 1 : Format of Database Records 



work with HMMs ( |Kornai, 1995[ )). However, compu- 
tational phonology is yet to provide tools for manipu- 
lating lexical and speech data using the full expressive 
power of the regular expression notation in a way that 
supports pure phonological research. 

This paper describes a lexical database system tai- 
lored to the needs of phonological research and exem- 
plified for Dschang, a language of Cameroon. An 
online lexicon (originally published as Bird & Tadad- 
jeu, 1997), contains records with the format in Fig- 
ure |l|. Only the most important fields are shown. 

The user interface is provided by a Web browser. A 



suite of Perl programs ( [Wall & Schwartz, 19911 ) gener- 
ates the search form in HTML and processes the query. 
Regular expressions in the query are passed directly 
to Perl, enabling the full flexibility of Perl extended 
regular expressions. A further extension to the nota- 
tion allows searches for minimal sets, groups of words 
which are minimally different according to some cri- 
terion. Hits are structured into a tabular display and 
returned as an html or I^TgX document. 

In the next section, a sequence of example queries 
is given to illustrate the format of queries and results, 
and to demonstrate how a user might interact with the 
system. A range of more powerful queries are then 
demonstrated, along with an explanation of the nota- 
tions for minimal pairs and projections. Next, some 
implementation details are given, and the component 
modules are described in detail. The last two sections 



describe planned future work and present the conclu- 
sions. 

EXAMPLE 

This section shows how the system can be used to sup- 
port phonological analysis. The language data comes 
from Dschang, a Grassfields Bantu language of Camer- 
oon, and is structured into a lexicon consisting of 2200 
records. Suppose we wished to learn about phonotac- 
tic constraints in the syllable rhyme. The following 
sequence of queries were not artificially constructed, 
but were issued in an actual session with the system 
in the field, running the Web server in a stand-alone 
mode. The first query is displayed below.^ 



Search Attributes: 

display 
root 
loanwords 
suffixed 
phrases 
time-limit 
vars 



count 
.* ( [$V] 
exclude 
include 
exclude 
2 minutes 



( [$C] ) # 



$B = "\.#-"; # boundaries 
SS = "pbtdkgcj'"; # stops 
$F = "zsvfZS"; # fricatives 
$0 = $S.$F; # obstruents 
$N = "mnN"; # nasals 
$G = "wy"; # glides 
$C = $0.$N.$G."hl"; # cons 
$V = "ieaouEOUS"; # vowels 



The main attribute of interest is the root attribute.^ 
The . * expression stands for a sequence of zero or 
more segments. The expressions $V and $C are vari- 
ables defined in the vars section of the query form. 
These are strings, but when surrounded with brackets, 
as in [$V] and [$C], they function as wild cards 
which match a single element from the string. The 
# character is a boundary symbol marking the end of 
the root. Observe that the root attribute contains 
two parenthesised subexpressions. These will be called 
parameters and have a special role in structuring the 
search output. This is best demonstrated by way of an 
example. Consider the table below, which is the result 
of the above query. In this table, the row labels are all 



The display is only a crude approximation to the HTML form. 
Note that the query form comes with the variables already filled in 
so that it is not necessary for the user to supply them, although they 
can be edited. The transcription symbols used in the system have 
the following interpretation: U=a, @=a, E=e, 0=o, N=r|, '=?. 

*In the following discussion, 'attribute' refers to a line in the 
query form while 'field' refers to part of a database record. 
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the segments which matched the variable $V, while the 
column labels are just the segments that matched $C. 

Search Results: 
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There are sufficient gaps in the table to make us wonder 
if all the segments are actually phonemes. For example, 
consider o and u, given that they are phonetically very 
similar ([co] and [u] respectively). We can easily set 
up o as an allophone of u before k. Only the case of 
glottal stop needs to be considered. So we revise the 
form, replacing $V with just the vowels in question, 
and replacing the $ C of the coda with apostrophe (for 
glottal stop). We add a term for the syllable onset and 
resubmit the query. See Figure ^ This time, several 
attributes are omitted from the display for brevity. 

We can now conclude that o and u are in comple- 
mentary distribution, except for the five words corre- 
sponding to pf and V onsets. But what are these words? 
We revise the form again, further restricting the search 
string as follows: 



Search Attributes: 

display : 
root : 



speech word gloss 
. * (pf |v) [ou] ' # 



The display parameter is set to speech word gloss 
allowing us to see (and hear) the individual lexical 
items. The results are shown below. 



Search Results: 



pf 



B lepfo' mortar 
ffl mpfu' blood pact 



W\ mvo' space in front of bed 

B avu' remainder 

B levu'te kitchen woodpile 



The cells of the output table now contain fragments of 
the lexical entries. The first part is an icon which, when 
clicked, plays the speech file. The second part is a gif 



of the orthographic form of the word. The third part 
is the English gloss. Note that the above nouns have 
different prefixes (e.g. le-, m-, a-). These are noun 
class prefixes and are not part of the root field. If 
we had wanted to take prefixes into consideration then 
the a s attribute, containing a transcription of the whole 
word, could have been used instead. 

Listening to the speech files it was found that the 
syllables pfo' and pfu' sounded exactly the same, as 
did vo' and vu'. The whole process up to this point 
had taken less than five minutes. After some quick 
informant work to recheck the data and hear the native- 
speaker intuitions, it was clear that the distinction bet- 
ween o and u in closed syllables was subphonemic. 

MORE POWERFUL QUERIES 
Constraining one field and displaying another 

In some situations we are not interested in seeing the 
field which was constrained, but another one instead. 
The next query displays the tone field for monosyllabic 
roots, classed into open and closed syllables. Although 
the root attribute is used in the query, the root field 
is not actually displayed. (This query makes use of a 
projection function which maps all consonants onto C 
and all vowels onto V, as will be explained later.) 



Search Attributes: 

display : 
root : 



tone 

#C+V(C?)# ($CV-proj) 



The C+ expression denotes a sequence of one or more 
consonants, while C? denotes an optional coda conso- 
nant. By making C? into a parameter (using paren- 
theses) the search results will be presented in a two 
column table, one column for open syllables (with a 
null label) and one for closed syllables (labelled C). 
A minor change to the root attribute, enlarging the 
scope of the parameter (\#c+ (vc?) \#), will produce 
the more satisfactory column labels V and VC. 

Searching for near-minimal sets 

Finding good minimal sets is a heuristic process. No 
attempt has been made to encode heuristics into the 
system. Rather, the aim has been to permit flexible 
interaction between user and system as a collection 
of minimal sets is refined. To facilitate this process, 
the regular expression notation is extended slightly. 
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Search Attributes: 



Search Results: 



display: count 

root: . * ( [$C] +) ( [ou] ) ' # 
axes: flip 
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Figure 2: Query to Probe the Phonemic Status of the 0/U Contrast 



Search Attributes: 

display: word gloss 

root: . * ( [$C] +) { [ou] } ( [$C] ) # 

Search Results: 



pf 



lepfo' 


mortar 


mpfu' 


blood pact 



mvo' space in front of bed 



avu' remainder 
levu'ti kitchen woodpile 



Figure 3: Minimal Sets for 0/U 



Recall the way that parameters (parenthesised subex- 
pressions) allowed output to be structured. One of the 
parameters will be said to be in focus. Syntactically, 
this is expressed using braces instead of parentheses. 
Semantically, such a parameter becomes the focus of 
a search for minimal sets. Typically, this parameter 
will contain a list of segments, such as { [ ou ] } , or 
an optional segment whose presence is to be contrasted 
with its absence, such as { h ? } . 

In order for a minimal set to be found, the parameter 
in focus must have more than one possible instantia- 
tion, while the other parameters remain unchanged. To 
see how this works, consider the following example. 
Suppose we wish to identify the minimal pairs for o/u 
discussed above, but without having to specify glottal 
stop in the query, as shown in Figure ||. Note this exam- 
ple of a 3D table. 



If this was not enough minimal pairs, we could relax 
the restrictions on the context. For example, if we do 
not wish to insist on the following consonant being 
identical across minimal pairs, we can remove the sec- 
ond set of parentheses thus: . * ( [$c] +) { [ou] } [$c] #. 
This now gives minimal pairs like legok work and 
rjgu' year. Observe that the consonant preceding the 
o/u vowel is fixed across the minimal pair, since this 
was still parenthesised in the query string. 

Usually, it is best for minimal pairs to have similar 
syntactic distribution. We can add a restriction that all 
minimal pairs must be drawn from the same syntactic 
category by making the whole part attribute into a 
parameter as follows. 



Search Attributes: 

display : 
root : 
part : 



tone 

.*([$C]+){[ou]}[$C]# 
(.*) 



Making the part attribute into a parameter adds an 
extra dimension to the table of results. We now only 
see an o/u minimal pair if the other parameters agree. 
In other words, all minimal pairs that are reported 
will contain the same consonant cluster before the o/u 
vowel and will be from the same syntactic category. 

Variables across attributes 

There are occasions where we need to have the same 
variable appearing in different attributes. For example, 
suppose we wanted to check where the southern dialect 
and the principal dialect have identical vowels:^ 



Roots are virtually all monosyllabic, so there will usually be 
a unique vowel sequence for the [ $V] + in the regular expression 
to match with. 
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Search Attributes: 



display : 
root : 
S-dialect : 



root s_dialect 
. * (3 [$V] +) . * 
. *$3 . * 



Search Attributes 

display : 
root : 
vars : 



tone 

#C+V(C?) # 
$CV-proj = 



($CV-proj) 
{tr/$C/C/; 



tr/$V/V/;} 



This query makes use of another syntactic extension 
to regular expressions. An arbitrary one-digit number 
which appears immediately inside a parameter allows 
the parameter to be referred to elsewhere. This means 
that whichever sequence of vowels matches [$V] + 
in the root field must also appear somewhere in the 
s_dialect field. 

Negative restrictions 

The simplest kind of negative restriction is built using 
the set complement operator (the caret). However this 
only works for single character complements. A much 
more powerful negation is available with the ? ! zero- 
width negative lookahead assertion, available in Perl 5, 
which I will now discuss. 

The next example uses the tone attribute. Dschang is 
a tone language, and the records in the lexicon include 
a field containing a tone melody. Tone melodies consist 
of the characters H (high), L (low), D (downstep) and 
F (fall). A single tone has the form D ? [ HL ] F ?, i.e. an 
optional downstep, followed by H or L, followed by an 
optional fall. The next example finds all entries starting 
with a sequence of unlike tones. 

Search Attributes: 

root: .*(1[$T]) (?!$1) [$T] .* 
vars : $T = D? [HL] F? 

The (1 [ $ T ] ) expression matches any tone and sets 
the $ 1 variable to the tone which was matched. The 
{?!$!) expression requires that whatever follows the 
first tone is different, and the final [ $ T ] insists that 
this same following material is a tone (rather than being 
empty, for example)]^ 

Projections 

I earlier introduced the notion of projections. In fact, 
the system allows the user to apply an arbitrary manip- 
ulation to any attribute before the matching is car- 
ried out. Here is the query again, this time with the 
$cv-pro j variable filled out. 

"^Care must be taken to ensure that the alphabetic encodings of 
distinct tones are sufficiently different from each other, so that one 
is not an initial substring of another. 



This causes the Perl tr (transliterate) function to be 
appUed to the root attribute before the #C+V (C? ) # 
regular expression is matched on this field. 

Projections can also be used to simulate second order 
variables, such as required for place of articulation. 
Suppose that the language has three places of articu- 
lation: L (labial), A (alveolar) and V (velar). We are 
interested in finding any unassimilated sequences in the 
data (i.e. adjacent consonants with different places of 
articulation). The following query does just this. Prior 
to matching, the segments which have a place of artic- 
ulation value are projected to that value, again using 
tr. The query expression looks for a sequence of any 
pair $ P $ P , where $ P is a second order variable ranging 
over places of articulation. 

Search Attributes: 

display: word 

root: . * (5$P) (? ! $5) ($P) . * ($P-proj) 
vars : $P-pro j=tr/pbmtdnkgN/LLLAAAVVV/; 
$P = [LAV] ; 

Observe that the second $P must be different from 
the first, because of the zero-width negative lookahead 
assertion { ? ! $ 5 ) . This states that immediately to 
the right of this position one does not find an instance 
of $5, where this variable is the place of articulation 
found in the first position. The output of the query is a 
3x3 table showing all words that contain unassimilated 
consonant sequences. 

SYSTEM OVERVIEW 
Lexicon compiler 

The base lexicon is in Shoebox format, in which the 
fields are not required to be in a fixed order. To save 
on runtime processing, a preprocessing step is applied 
to each field. For example, the contents of the \w 
field, comprising characters from the Cameroon char- 
acter set, are replaced by a pointer a graphics file for 
the word (i.e. a URL referencing a gif).[] Each record 
is processed into a single line, where fields occur in a 

^These gifs were generated using ETgX along with the utilities 

pstogif and giftool. 
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canonical order and a field separator is inserted, and 
the compiled lexicon is stored as a DBM file for rapid 
loading. 

The query string 

The search attributes in the query form can contain 
arbitrary Perl V5 regular expressions, along with some 
extensions introduced above. A CGI program con- 
structs a query string based on the submitted form data. 
The query string is padded with wild cards for those 
fields which were not restricted in the query form. 

The dimensionality of the output and the axis labels 
are determined by the appearance of 'parameters' in the 
search attributes. These parenthesised subexpressions 
are copied directly into the query string. So, for exam- 
ple, the first query above contained the search expres- 
sion . * ( [$v] ) ( [$c] ) # applied to the root field. This 
field occupies fifth position in the compiled version of 
a record, and so the search string is as follows. The 
variable $e matches any sequence of characters not 

containing the field separator. 

$search = / ~ $e; $e ; $e; $e ; .*([$V]) ([$C])#; 

$e; $e; $e; $e; $e; $e; $e; $e$/ 

The search loop 

Search involves a linear pass over the whole lexicon 
%LEX.0 The parameters contained in $ search are 
tied to the variables $ 1 - $ 4 . These are stored in four 
associative arrays $diml - $dim4 to be used later as 
axis labels. 

foreach $entry (keys %LEX) { 

if ($LEX{ $entry } =~ /$search/) { 
$diml{$l}++; 
$dim2{$2}++; 
$dim3{$3}++; 
$dim4{$4}++; 

$hits{ "$1; $2; $3; $4" } .= ";".$entry; 

} 

} 

Finally, a pointer to the entry is stored in the 4D 
array $hits (appended to any existing hits in that 
cell.) Here we see that the structuring of the output 
table using parameters is virtually transparent, with 
Perl itself doing the necessary housekeeping. 

As an example, suppose that the following lexical 
entry is being considered at the top of the above loop: 

^Inverting on individual fields was avoided because of the run- 
time overheads and the fact that this prevents variable instantiation 
across fields. 



$entry = 0107 
$LEX{$entry} = 

0107; ;<img src="akup . gif ">; 

#a.kup#; #kup#;LL; ; *k ^ub ' ; n; 7 / 6 , 8 ; 

skin, bark; peau, \ ' ecorce ; 

By matching this against the query string given in our 
first example we end up matching . * { [ $V] ) { [ $C ] ) # 
with #kup#. This results in $l=u and $2=p. The 
entries $diml{u} and $dim2{p} are incremented, 
recording these values for later use in the $V and 
$C axes respectively. Finally $hits { "u; p; ; " ) is 
updated with the index 010 7. 

The display loop 

This module cycles through the axis labels that were 
stored in $diml - $dim4 and combines them to access 
the $hits array. At each level of nesting, code is 
generated for the HTML or table output. At the 

innermost level, the fields selected by the user in the 
display attribute are used to build the current cell. 

FUTURE WORK 

A number of extensions to the system are planned. 
Since Dschang is a tone language, it would be partic- 
ularly valuable to have access to the pitch contours of 
each word. These will eventually be displayed as small 
gifs, attached to the lexical entries. 

Another extension would be to permit updates to the 
lexicon through a forms interface. A special instance 
of the search form could be used to validate existing 
and new entries, alerting the user to any data which 
contradicts current hypotheses. 

The regular expression notation is sometimes cum- 
bersome and opaque. It would be useful to have a 
higher level language as well. One possibility is the 
notation of autosegmental phonology, which can be 
com piled into finite-state automata ( Bird & Ellison, 
1994). The graphics capabilities for this could be pro- 
vided on the client side by a Java program. 

A final extension, dependent on developments with 
HTML itself, would be to provide better support for spe- 
cial characters and user-definable keystrokes for access- 
ing them. 

CONCLUSION 

This paper has presented a hypertext lexicon tailored to 
the practical needs of the phonologist working on large 
scale data problems. The user accesses the lexicon via 
a forms interface provided by HTML and a browser. A 
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CGI program processes the query. The user can refine a 
query during the course of several interactions with the 
system, finally switching the output to I4TgX format for 
direct inclusion of the results in a research paper. An 
extension to the regular expression notation was used 
for searching for minimal pairs. Parenthesised subex- 
pressions are interpreted as parameters which control 
the structuring of search results. These extensions, 
though intuitively simple, make a lot of expressive 
power available to the user. The current prototype sys- 
tem has been used heavily for substantive phonologi- 
cal fieldwork and analysis on the field, documented in 



(Bird, 1997). There are a number of ensuing benefits of 
this approach for phonological research: (i) it supports 
a quantitative method of doing phonological research; 
(ii) it gives universal access to the same set of infor- 
mants; (iii) it enables other researchers to hear the orig- 
inal speech data without having to rely on published 
transcriptions; (iv) it makes the full power of regu- 
lar expression search available, and search results are 
full multimedia documents; and (v) it enables the early 
refutation of false hypotheses, shortening the analysis- 
hypothesis-test loop. 
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